Sains
Malaysiana 54(8)(2025): 2087-2097
http://doi.org/10.17576/jsm-2025-5408-17
Improved
Robust Principal Component Analysis based on Minimum Regularized Covariance
Determinant for the Detection of High Leverage Points in High Dimensional Data
(Penambahbaikan Analisis Komponen Utama berdasarkan
Penentu Kovarian Teratur Minimum bagi Mengecam Titik Tuasan Tinggi untuk Data
Dimensi Tinggi)
HABSHAH MIDI1,2,*, JAAZ SUHAIZA1,3, MOHD ASLAM1,2, HANI SYAHIDA2 & EMI AMIELDA3
1Institute for Mathematical
Research, Universiti Putra Malaysia, 43400 UPM Serdang, Selangor, Malaysia
2Department of Mathematics &
Statistics, Universiti Putra Malaysia, 43400 UPM Serdang,
Selangor, Malaysia
3Faculty of Computing & Multimedia,
Universiti Poly-Tech Malaysia, 56100 Cheras, Kuala Lumpur, Malaysia
Received: 22 April
2024/Accepted: 13 March 2025
Abstract
This
paper presents an extension work of robust principal component analysis
(ROBPCA) denoted as IRPCA, to improve the accuracy of the detection of high
leverage points (HLPs) in high dimensional data (HDD). The IRPCA employs the
Principal Component Analysis (PCA) to reduce the dimension of the data set and
subsequently a robust location and scatter estimates of the PC scores are
obtained based on the Minimum Regularized Covariance Determinant (MRCD). Instead
of using robust score distance to detect HLPs as in ROBPCA; in the proposed
IRPCA, we have considered using Robust Mahalanobis distance (RMD). The performance of the IRPCA is compared to
the ROBPCA and the Minimum Regularized Covariance Determinant and PCA-based
method (MRCD-PCA) for the identification of HLPs in HDD. The results signify
that all the three methods are very successful in the detection of HLPs with no
masking effect. Nonetheless, the ROBPCA suffers from serious swamping problems
for less than 30% of HLPs. The proposed IRPCA and the MRCD-PCA have similar
performance, having very small swamping effect. However, the MRCD-PCA algorithm
is quite cumbersome and required longer computational running time. The
attractive feature of the IRPCA is that it provides a simpler algorithm and it is
very fast.
Keywords:
High Leverage Point; minimum regularized covariance determinant; principal
component analysis; robust mahalanobis distance
Abstrak
Kertas ini membentangkan kerja lanjutan bagi Analisis
Komponen Utama Teguh (ROBPCA) ditandakan dengan IRPCA, untuk meningkatkan
ketepatan pengecaman titik tuasan tinggi (HLPs) dalam data dimensi tinggi
(HDD). IRPCA menggunakan Analisis Komponen Utama (PCA) bagi menurunkan dimensi
set data dan seterusnya penganggar lokasi dan skala skor PC dikira berdasarkan
Penentu Kovarian Teratur Minimum (MRCD). Dengan tidak menggunakan jarak skor teguh untuk pengecaman HLPs seperti
ROBPCA; dalam kaedah IRPCA yang dicadangkan, kami telah mempertimbangkan penggunaan
Jarak Mahalanobis Teguh (RMD). Prestasi IRPCA yang dicadang dibandingkan dengan
kaedah ROBPCA dan kaedah Penentu Kovarian Teratur Minimum dan PCA (MRCD-PCA)
bagi mengecam HLPs dalam HDD. Keputusan menunjukkan ketiga-tiga kaedah sangat
berjaya dalam pengesanan HLPs tanpa kesan penyorokan. Walau bagaimanapun,
ROBPCA mengalami masalah kesan limpahan yang serius apabila terdapat HLPs
kurang daripada 30%. Prestasi IRPCA yang dicadangkan dan ROBPCA ada lah sama;
mempunyai kesan limpahan yang sangat kecil. Namun begitu, algoritma MRCD-PCA
agak rumit dan memerlukan masa yang panjang. Sifat menarik bagi IRPCA ialah ia memberi
algoritma yang mudah dan masa pengiraan yang singkat.
Kata kunci: Analisis komponen utama; jarak Mahalanobis
teguh; penentu kovarian teratur minimum; titik tuasan baik
REFERENCES
Agostinelli, C.,
Leung, A., Yohai, V.J. & Zamar, R.H. 2015. Robust estimation of
multivariate location and scatter in the presence of cellwise and casewise
contamination. Test 24(3): 441-461.
https://doi.org/10.1007/s11749-015-0450-6
Boudt,
K., Rousseeuw, P.J., Vanduffel, S. & Verdonck, T. 2018. The minimum
regularized covariance determinant estimator. Statistics and Computing 30:
113-128. https://doi.org/10.1007/s11222-019-09869-x
Boulesteix,
A.L. & Strimmer, K. 2007. Partial least squares: A versatile tool for the
analysis of high-dimensional genomic data. Briefings in Bioinformatics 8(1): 32-44. https://doi.org/10.1093/bib/bbl016
Cao,
L. 2006. Singular Value Decomposition Applied to Digital Image Processing. Division of Computing Studies, Arizona State University. pp. 1-15.
http://www.lokminglui.com/CaoSVDintro.pdf
Chiang,
J-T. 2016. The masking and swamping effects using the planted mean-shift
outliers models. International Journal of Contemporary Mathematical Sciences 2(7): 297-307. https://doi.org/10.12988/ijcms.2007.07024
Dhhan, W., Rana, S. & Midi, H. 2015. Non-sparse ɛ-insensitive
support vector regression for outlier detection. J. Appl. Stat. 42: 1723-1739.
Esbensen,
K.H., Sch¨onkopf, S., Midtgaard, T. & Guyof, D. 1994. Multivariate
Analysis in Practice. Camo, Trondheim.
Habshah,
M., Norazan, M.R. & Imon, A.H.M.R. 2009. The performance of
diagnostic-robust generalized potentials for the identification of multiple
high leverage points in linear regression. Journal of Applied Statistics 36(5): 507-520. https://doi.org/10.1080/02664760802553463
Hotelling,
H. 1933. Analysis of a complex of statistical variables into principal
components. Journal of Educational
Psychology 24(6): 417-441. https://doi.org/10.1037/h0071325
Huber, P.J. 1973.
Robust regression: Asymptotics, conjectures and Monte Carlo. The Annals of
Statistics 1(5): 799-821.
Hubert,
M., Rousseeuw, P.J. & Verdonck, T. 2012. A deterministic algorithm for
robust location and scatter. Journal of Computational and Graphical
Statistics 21(3): 618-637. https://doi.org/10.1080/10618600.2012.672100
Hubert, M.,
Rousseeuw, P.J. & Vanden Branden, K. 2005. ROBPCA: A new approach to robust
principal component analysis. Technometrics 47(1): 64-79. https://doi.org/10.1198/004017004000000563
Hubert, M.,
Reynkens, T., Schmitt, E. & Verdonck, T. 2015. Sparse PCA for
high-dimensional data with outliers. Technometrics 58(4): 424-434. https://doi.org/10.1080/00401706.2015.1093962
Jolliffe, I.T. 1986. Principal Component Analysis.
Springer Series in Statistics. Berlin: Springer.
Killeen, D.P., Card, A., Gordon, K.C. & Perry, N.B.
2019. First use of handheld Raman spectroscopy to analyze omega-3 fatty acids
in intact fish oil capsules. Applied
Spectroscopy 74(3): 365-371.
Lemberge,
P., De Raedt, I., Janssens, K.H., Wei, F. & Van Espen, P.J. 2000.
Quantitative analysis of 16-17th century archaeological glass vessels using PLS
regression of EPXMA and μ-XRF data. Journal of Chemometrics 14(5-6):
751-763. https://doi.org/10.1002/1099-128X(200009/12)14:5/6<751
Lim,
H.A. & Midi, H. 2016. Diagnostic robust generalized potential based on
Index Set Equality (DRGP (ISE)) for the identification of high leverage points
in linear model. Computational Statistics 31: 859-877.
Midi, H., Hendi, T.H., Uraibi, H., Arasan, J. & Ismaeel,
S.S. 2023. An efficient method of identification of influential observations in
multiple linear regression and its application to real data. Sains Malaysiana 52(12): 3879-3892.
Midi,
H., Ismaeel, S.S., Arasan, J. &
Mohammad, A.M. 2021. Simple and fast generalized-M
(GM) estimator and its application to real data. Sains Malaysiana 50(3): 859-867.
Midi, M., Talib, H., Jayanthi, A. & Uraibi, H.S. 2020.
Fast and robust diagnostic technique for the detection of high leverage points. Journal of Science and Technology 28(4): 1203-1220.
Mahalanobis,
P.C. 1936. On the generalized distance in statistics. Proceedings of the
National Institute of Sciences of India 2(1): 49-55.
Maronna,
R.A. & Zamar, R.H. 2002. Robust estimates of location and dispersion for
high-dimensional datasets. Technometrics 44(4): 307-317. https://doi.org/10.1198/004017002188618509
Rana,
M.S., Midi, H. & Imon, A.H.M.R. 2009. A robust rescaled moment test for
normality in regression. Journal of Mathematics and Statistics 5(1):
54-62.
Rashid, A.M., Midi, H., Dhnn, W. & Arasan, J.
2021. An efficient estimation and classification methods for high dimensional
data using robust iteratively reweighted SIMPLS algorithm based on Nu-support
vector regression. IEEE Access 9: 45955-45967.
Rashid, A.M., Midi, H., Dhnn, W. & Arasan, J.
2022. Detection of outliers in high-dimensional data using Nu-support vector
regression. Journal of Applied Statistics 49(10): 2550-2569.
Rousseeuw, P.J. 1985. Multivariate estimation with high
breakdown point. Mathematical Statistics and Applications 8: 37.
Rousseeuw,
P. & Driessen, K. 1999. A fast algorithm for the minimum covariance. Technometrics 41(3): 212-223.
Rousseeuw, P.J. & Van
Zomeren, B.C. 1990. Unmasking multivariate outliers and leverage points. Journal
of the American Statistical Association 85: 633-651.
Siti Zahariah
& Habshah Midi. 2023. Minimum regularized covariance determinant and
principal component analysis - based method for the identification of high
leverage points in high dimensional sparse data. Journal of Applied Statistics 50(13): 2817-2835.
Siti Zahariah, Habshah Midi &
Mohd Shafie Mustafa. 2022. An improvised SIMPLS
estimator based on MRCD-PCA weighting function and its application to real
data. Symmetry 13(11): 2211.
Varmuza, K. & Filzmoser, P. 2009. Introduction to Multivariate
Statistical Analysis in Chemometrics. Boca Raton: CRC Press.
doi:10.1201/9781420059496
*Corresponding author; email: habshah@upm.edu.my